QC metrics has been used to justify the pipelien quality. Here we propose an analysis to evaluate the pipeline updates/changes by using Picard QC metrics.
The key input data for this test are two Picardtools QC metrics from two pipelines
First load functions we will use in this test
Here is the list of input parameters
source('analysis_functions.R')
## inputs, python style
option_list <- list(
make_option(
"--metrics1",
type = "character",
default = NULL,
help = " First/Base metrics file name",
metavar = "character"
),
make_option(
"--metrics2",
type = "character",
default = NULL,
help = " Second/Updated metrics file name",
metavar = "character"
),
make_option(
"--output_prefix",
type = "character",
default = "out",
help = "output file name [default= %default]",
metavar = "character"
),
make_option(
"--metKeys",
type = "character",
default = "out",
help = "A list of metrics to analyze",
metavar = "character"
)
)
# arg rendered by rmkdown
args<-strsplit(commandArgs(trailingOnly = TRUE),split=' ')
opt_parser <- OptionParser(option_list = option_list)
opt <- parse_args(opt_parser,args=args[[1]])
# load params
output_name <- opt$output_prefixThe QC metrics files are the collection of Picardtools QC outputs, which are formated as N*M matrix The rows are QC metrics and columns are cells/samples.
Here is the list of metrics we will analysis with
First we match column names(cell IDs) between two pipelines metrics
colnames1 <- colnames(met1)
colnames2 <- colnames(met2)
mlist <- match(colnames1, colnames2)
nalist <- is.na(mlist)
if (sum(nalist) > 0) {
print("input files have different column elements")
exit()
}
met2 <- met2[, mlist]Then select the subset of QC metrics to run analysis.
Finally, we have two QC metrics to run analysis on.
To evaluate the pipeline changes/updates, we will carry out statistical tests on each QC metric and these tests are described and visualized as following:
Let’s run the analysis mentioned above on the set of QC metrics.
out <- c()
pouts <- list()
for (ii in 1:length(metKeys)) {
print(metKeys[ii])
x <- as.numeric(met1.core[metKeys[ii], ])
y <- as.numeric(met2.core[metKeys[ii], ])
z <- data.frame('Base' = x, 'Updated' = y)
# linear regression model
p1 <- plotlm(z)
# plot histogram
p2 <- plothist(z)
# plot box
p3 <- plotBox(z)
# plot KS
p4 <- plotKS(z)
# arranage plots in one figure
gp <-
ggarrange(
p2$p,
p3$p,
p1$p,
p4$p,
labels = c("A", "B", "C", "D")
) # arrange 4 plots
gp <-
gp + ggtitle(paste(metKeys[ii])) + theme(plot.title = element_text(
hjust = 0.5,
size = 15,
face = 'bold'
))
print(gp)
out <-
rbind(out,
c(
metKeys[ii],
p1$beta,
p1$a,
p1$r2,
p1$fpval,
p3$pval,
p4$D,
p4$pval
)) # test, statistics outputs
}## [1] "detected_ratio"
## [1] "MT_ratio"
## [1] "MEDIAN_INSERT_SIZE"
## [1] "MEAN_INSERT_SIZE"
## [1] "PCT_CHIMERAS"
## [1] "PCT_PF_READS_ALIGNED"
## [1] "PCT_PF_READS_IMPROPER_PAIRS"
## [1] "PCT_READS_ALIGNED_IN_PAIRS"
## [1] "PCT_CODING_BASES"
## [1] "PCT_INTERGENIC_BASES"
## [1] "PCT_INTRONIC_BASES"
## [1] "PCT_UTR_BASES"
## [1] "PCT_USABLE_BASES"
## [1] "PCT_MRNA_BASES"
## [1] "PCT_RIBOSOMAL_BASES"
## [1] "PERCENT_DUPLICATION"
## [1] "MEDIAN_5PRIME_TO_3PRIME_BIAS"
## [1] "MEDIAN_3PRIME_BIAS"
## [1] "MEDIAN_5PRIME_BIAS"
## [1] "MEDIAN_CV_COVERAGE"
## [1] "PF_MISMATCH_RATE"
## [1] "MEDIAN_INSERT_SIZE"